Conversation
|
@MagdalenaKotynia @maciejmajek please take a look, if you like this reconstruction of package structure and the new frame. If yes i will proceed with applying this refactor to other tasks and changes from 2 conflicted branches. The refactor is not completed yet, i only applied the new frame to 2 task as an example, so don't pay attention to the not touched parts of code |
2e189d9 to
38e64ef
Compare
|
How to log errors when extra calls passed? for example in 3 calls there are errors but in 4th agent done it correctly, should we log the previous 3, even if validator passed eventually? |
| trace_id=str(run_id), | ||
| name="tool calls result", | ||
| value=float(success), | ||
| value=float(score), |
There was a problem hiding this comment.
Score is already float, so you don't need to convert it to float.
| run_id=run_id, | ||
| key="tool calls result", | ||
| score=float(success), | ||
| score=float(score), |
There was a problem hiding this comment.
|
|
||
| done_properly = 0 | ||
| for validator in self.validators: | ||
| if_success, remaining_tool_calls = validator.validate( |
| for arg_name, arg_value in expected_args.items(): | ||
| if arg_name in tool_call["args"]: | ||
| if tool_call["args"][arg_name] != arg_value: | ||
| SubTaskValidationError( |
There was a problem hiding this comment.
| SubTaskValidationError( | |
| raise SubTaskValidationError( |
| f"Expected argument '{arg_name}' should have value '{arg_value}', but got '{tool_call['args'][arg_name]}'" | ||
| ) | ||
| else: | ||
| SubTaskValidationError( |
There was a problem hiding this comment.
| SubTaskValidationError( | |
| raise SubTaskValidationError( |
| if arg_name not in expected_args: | ||
| # If this argument is not required, check if it's an allowed optional argument | ||
| if not expected_optional_args or arg_name not in expected_optional_args: | ||
| SubTaskValidationError( |
There was a problem hiding this comment.
| SubTaskValidationError( | |
| raise SubTaskValidationError( |
| # If optional argument has expected value, check if the value is correct | ||
| elif expected_optional_args[arg_name]: | ||
| if expected_optional_args[arg_name] != arg_value: | ||
| SubTaskValidationError( |
There was a problem hiding this comment.
| SubTaskValidationError( | |
| raise SubTaskValidationError( |
| except StopIteration: | ||
| return True, tool_calls[i:] | ||
|
|
||
| self.log_error(msg="Failed to validate") |
There was a problem hiding this comment.
I suggest logging less generic message with more info about what tool failed to validate
MagdalenaKotynia
left a comment
There was a problem hiding this comment.
The proposed new structure overall looks good to me.
Please remember about handling strict/not strict validation. Currently, both implemented validators seem to be strict and not allow agent to self-correct. Please also remember to handle tool call args that may have allowed value in some range and situations when value does not matter.
FYI I didn't do code review, I just reviewed the structure, but I left some comments regarding the code when I noticed something by the way.
i think you you misunderstood the validators-task relation. You can make validation strict or less strict by adjusting subtasks passed to validators and extra_tool_calls param, please look at also look at the pic of framework: |
b1b6889 to
89a3a1e
Compare
There was a problem hiding this comment.
Answering your question, please rename the file to interface_parser.py and move it to the generic folder.
|
moved docs to rai_bench: bdc58e6 |
|
saving errors from extra tool calls: 010fc5d |
|
removed default empty dicts from subtasks and improved error messages in validators: |
|
@maciejmajek we speaked about setting recursion limit to number of tool calls required in task, but 1 step is just one node execution, not a tool call, so restricting it like this does not make much sense (Tool calls are in AIMessages). I've set the recursion limit to 4*(required_tool_calls+extra_tool_calls), changed agent.invoke to stream which lets us collect messages from agent even when recursion limit occurs: 0cddef4 |
ce0979b to
103d22f
Compare
e6953e3 to
82723c7
Compare
Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai>
Co-authored-by: Magdalena Kotynia <magdalena.kotynia@robotec.ai>
fixes to subtasks import fix to old tests
fixes tot validators
deleted unused code
adjusted error messages in validators
gathering tool calls even when recurrsion limit occurs
refactor running benchmarks to take args added model initilization via model name
c80ed2e to
3fb30ca
Compare
| def get_llm_model_direct( | ||
| model_name: str, | ||
| vendor: str, | ||
| config_path: Optional[str] = None, | ||
| **kwargs: Any, | ||
| ) -> ChatOpenAI | ChatBedrock | ChatOllama: | ||
| config = load_config(config_path) | ||
| model_config = getattr(config, vendor) | ||
|
|
||
| logger.info(f"Initializing Model: {model_name}, Vendor: {vendor}") | ||
| if vendor == "openai": | ||
| from langchain_openai import ChatOpenAI | ||
|
|
||
| model_config = cast(OpenAIConfig, model_config) | ||
|
|
||
| return ChatOpenAI(model=model_name, base_url=model_config.base_url, **kwargs) | ||
| elif vendor == "aws": | ||
| from langchain_aws import ChatBedrock | ||
|
|
||
| model_config = cast(AWSConfig, model_config) | ||
|
|
||
| return ChatBedrock( | ||
| model_id=model_name, | ||
| region_name=model_config.region_name, | ||
| **kwargs, | ||
| ) | ||
| elif vendor == "ollama": | ||
| from langchain_ollama import ChatOllama | ||
|
|
||
| model_config = cast(OllamaConfig, model_config) | ||
| return ChatOllama(model=model_name, base_url=model_config.base_url, **kwargs) | ||
| else: | ||
| raise ValueError(f"Unknown LLM vendor: {vendor}") | ||
|
|
||
|
|
There was a problem hiding this comment.
This could be a factory, apart from that lgtm

Purpose
Unify folder structure and naming of benchmarks
Refactor tool_calling bench as there are conflicts between branches:
https://github.com/RobotecAI/rai/tree/mk/feat/spatial-reasoning-tasks
https://github.com/RobotecAI/rai/tree/jm/feat/tool-benchmark-custom-interfaces
which both changed the tool calling benchmark and were done in a hurry. That resulted in a lot of conflicts and not the best code.
Merge and unify taks that are already on development with spatial, navigation and custom interfaces tasks.
Improve and unify logging and saving results
This PR is big in size, please follow below changes descriptions. Firstly focus on 1st and 2nd points. New frame is essential in this PR, it dictates how validation is executed and a lot of other changes in this PR are adjusted to this frame.
Related PRs and branches will be also closed when this PR is merged.
Proposed Changes
all naming unified, now there the benchmarks:
manipulation_o3deandtool_calling_agent. Every folder related to the benchmark will have exactly that name.all code providing framework for creating benchmark in adequate folder
all code related to specific benchmark implementation in
examples/folderexperiment logs and results in
experiments/folderfiles/folders responsible for interfaces, tasks, benchmarks etc. named the same across benchmarks
REST OF THE POINTS ARE FOR TOOL CALLING AGENT BENCHMARK:
2. New frame for tool_calling_agent benchmark:
examples/tool_calling_agent/tasks.pyfor more intuition. Every Task has always the same prompt and available tools, only the validation methods can be parametrized. On top of validator, you can pass extra_tool_calls param to allow model to correct itself.tasks
models
Unit tests
tests
for subtasks and validators
New GetInterfaceTool
old version didnt return types of fields
structure of results:
results now will have list of validators that will show what is expected by every validator
followed by passed list which holds bool for every validator
followed by score(which is redundant with passed, but its weird to not have score in results xd, if you have ideas here, please share)
followed by errors which is list of lists, where every validator has its own list of errors.
Args passed when running tool calling agent benchmark
User can pass 2 args when running benchmark - model_name and vendor
Small docs validation as i wanted to paste image. This docs is small for now but i guess there will be docs for benchmarks in the future anyway.
Script to test multiple models, different benchamrks or couple repeats in one go https://github.com/RobotecAI/rai/blob/jm/refactor/rai_bench/src/rai_bench/rai_bench/examples/test_models.py
Issues
515
related PRs and branches:
#493
#487
mk/feat/tool-calling-bench-navigation-tasks
Testing
Test single
tests:
script running benchmarks:
next steps